Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache #400

ikawrakow · 2025-05-09T07:03:39Z

The implementation was assuming that the K and V cache are contiguous, and was using this assumption to dequantize to fp16. This is certainly wrong for the V cache, which is just a view of the K cache with rows of 512 instead of 576 elements.

@JohannesGaessler You may want to take a look at this PR. I don't think your PR in mainline llama.cpp can work for DeepSeek models with quantized KV cache.

A test session with this model:

./bin/llama-cli -m ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf -t 32 -ngl 100 -mla 3 -fa -c 32768 -s 1234 -ot exps=CPU -cnv -ctk q8_0 -ctv q8_0


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
Log start
main: build = 3673 (4084ca73)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1234
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V2.5 1210
llama_model_loader: - kv   3:                            general.version str              = V2.5-1210
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 160x14B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = deepseek
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv   9:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv  10:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  11:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv  12:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv  13:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  14:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  15:                   deepseek2.rope.freq_base f32              = 10000,000000
llama_model_loader: - kv  16: deepseek2.attention.layer_norm_rms_epsilon f32              = 0,000001
llama_model_loader: - kv  17:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  18:                          general.file_type u32              = 23
llama_model_loader: - kv  19:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  20:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  21:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  22:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  23:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  24:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  25:       deepseek2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  26:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  27:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  28:             deepseek2.expert_weights_scale f32              = 16,000000
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40,000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0,100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /models_out/DeepSeek-V2.5-1210-GGUF/D...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 716
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 139
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                                split.count u16              = 3
llama_model_loader: - kv  52:                        split.tensors.count i32              = 959
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq3_xxs:  597 tensors
llama_model_loader: - type iq3_s:   61 tensors
llm_load_vocab: special tokens cache size = 18
llm_load_vocab: token to piece cache size = 0,6411 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-06
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 160
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 0,025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 236B
llm_load_print_meta: model ftype      = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params     = 235,741 B
llm_load_print_meta: model size       = 84,604 GiB (3,083 BPW) 
llm_load_print_meta: repeating layers = 84,058 GiB (3,077 BPW, 234,693 B parameters)
llm_load_print_meta: general.name     = DeepSeek V2.5 1210
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1536
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 16,0
llm_load_print_meta: expert_weights_norm  = 0
llm_load_print_meta: expert_gating_func   = softmax
llm_load_print_meta: rope_yarn_log_mul    = 0,1000
llm_load_tensors: ggml ctx size =    0,80 MiB
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:        CPU buffer size = 37343,30 MiB
llm_load_tensors:        CPU buffer size = 37866,68 MiB
llm_load_tensors:        CPU buffer size = 10656,64 MiB
llm_load_tensors:        CPU buffer size =   214,84 MiB
llm_load_tensors:      CUDA0 buffer size =  5109,97 MiB
....................................................................................................
============ llm_load_tensors: need to compute 60 wk_b tensors
Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 0,025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:      CUDA0 KV buffer size =  1147,53 MiB
llama_new_context_with_model: KV self size  = 1147,50 MiB, c^KV (q8_0): 1147,50 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0,39 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  6346,00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    74,01 MiB
llama_new_context_with_model: graph nodes  = 3350
llama_new_context_with_model: graph splits = 179
main: chat template example: You are a helpful assistant
<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: interactive mode on.

sampling:

repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000

top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800

mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000

sampling order:

CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature

generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to the AI.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

> Count from 1 to 10 in French

Sure! Here's how you count from 1 to 10 in French:

Un
Deux
Trois
Quatre
Cinq
Six
Sept
Huit
Neuf
Dix

Tell me everything you know about radiation therapy of cancer

Radiation therapy, also known as radiotherapy, is a medical treatment that uses high doses of radiation to kill cancer cells and shrink tumors. Here’s an overview of everything you need to know about radiation therapy for cancer:

Types of Radiation Therapy

External Beam Radiation Therapy (EBRT):
- Description: Radiation is delivered from a machine outside the body, targeting the cancerous tumor.
- Common Techniques:
  - 3D Conformal Radiation Therapy: Uses multiple radiation beams to shape the treatment to the tumor’s 3D shape.
  - Intensity-Modulated Radiation Therapy (IMRT): Delivers varying doses of radiation to different parts of the tumor, reducing damage to nearby healthy tissues.
  - Proton Therapy: Uses protons instead of X-rays, allowing for precise targeting of the tumor with minimal radiation to surrounding tissues.
  - Stereotactic Radiosurgery (SRS) and Stereotactic Body Radiation Therapy (SBRT): High-precision techniques used for small tumors or lesions, often in the brain or lung.
Internal Radiation Therapy (Brachytherapy):
- Description: Radioactive sources are placed inside the body, either temporarily or permanently, directly into or near the tumor.
- Types:
  - High Dose Rate (HDR) Brachytherapy: Temporary placement of radioactive material for a short period.
  - Low Dose Rate (LDR) Brachytherapy: Permanent placement of radioactive seeds, commonly used for prostate cancer.
Systemic Radiation Therapy:
- Description: Radioactive substances are administered through the bloodstream, targeting cancer cells throughout the body.
- Examples:
  - Radioactive iodine (I-131) for thyroid cancer.
  - Lutetium-177 (Lu-177) or Yttrium-90 (Y-90) for neuroendocrine tumors.

Purpose of Radiation Therapy

Cancer Treatment:
- Curative Intent: To eliminate the cancer completely, often used in early-stage cancers.
- Palliative Treatment: To relieve symptoms and improve quality of life for advanced-stage cancers.
- Adjuvant Therapy: Used after surgery to eliminate any remaining cancer cells.
- Neoadjuvant Therapy: Used before surgery to shrink the tumor, making surgery easier and potentially reducing the extent of surgery needed.
Prevention of Recurrence:
- Radiation may be used to reduce the risk of cancer recurrence in high-risk patients.

Side Effects of Radiation Therapy

Acute Side Effects:
- Skin Reactions: Redness, irritation, and peeling.
- Fatigue: Common and often temporary.
- Swelling or Edema: Due to fluid accumulation in treated areas.
- Mucositis: Inflammation of the mucous membranes, common in head and neck cancers.
Late Side Effects:
- Tissue Fibrosis: Scarring of the irradiated tissues.
- Lymphedema: Swelling due to damage to the lymphatic system.
- Cognitive Changes: Rare, but possible in brain radiation.
- Secondary Cancers: Extremely rare but possible years after treatment.

Factors Influencing Radiation Therapy

Tumor Characteristics:
- Type, size, and location of the tumor.
- Whether the tumor is localized or has spread (metastatic).
Patient Characteristics:
- Age and overall health.
- Previous cancer treatments and their outcomes.
- Genetic factors and cancer predisposition.
Radiation Techniques and Doses:
- The type of radiation used and the dose administered.
- The fractionation schedule (how the dose is divided over time).

Advances in Radiation Therapy

Imaging and Targeting:
- Advanced imaging techniques like PET scans help in precisely targeting the tumor.
- Image-guided radiation therapy (IGRT) uses real-time imaging to adjust the radiation beam during treatment.
Personalized Medicine:
- Customized treatment plans based on the individual patient’s tumor characteristics and genetics.
Integration with Other Treatments:
- Combination with chemotherapy, immunotherapy, or surgery for better outcomes.

Conclusion

Radiation therapy is a powerful tool in the fight against cancer, offering curative potential, symptom relief, and preventive measures. Advances in technology continue to improve its precision and reduce side effects, making it an essential component of modern cancer care.

How many r's are there in strawberry

The word "strawberry" contains 3 r's.

Can you solve the following system of equations for me? x + y = 2 x - y = 3

Certainly! Let's solve the system of equations step by step:

We are given two equations:

( x + y = 2 )
( x - y = 3 )

Step 1: Add the two equations

Adding the left-hand sides:
[ (x + y) + (x - y) = 2 + 3 ]
[ 2x = 5 ]

Step 2: Solve for ( x )

[ 2x = 5 ]
[ x = \frac{5}{2} ]
[ x = 2.5 ]

Step 3: Substitute ( x = 2.5 ) into one of the original equations to find ( y )

Let's use the first equation:
[ x + y = 2 ]
[ 2.5 + y = 2 ]
[ y = 2 - 2.5 ]
[ y = -0.5 ]

Final Solution

The solution to the system of equations is:
[ x = 2.5 ]
[ y = -0.5 ]

Can you explain the German electoral system?

Certainly! The German electoral system is characterized by a mixed-member proportional representation (MMP) system, which combines elements of both constituency-based voting and proportional representation. Here’s an overview of how it works:

Electoral System Overview

Electoral Districts:
- Germany is divided into 299 single-member constituencies (Wahlkreise), each represented by one directly elected Member of Parliament (MP).
- Voters cast two votes: one for a candidate in their local constituency (Erststimme or first vote) and one for a political party (Zweitstimme or second vote).
Directly Elected MPs (Erststimme):
- The candidate with the most votes in each constituency wins a seat in the Bundestag (Federal Parliament).
Party List MPs (Zweitstimme):
- The second vote determines the proportional representation of parties in the Bundestag.
- Each party’s share of the total second votes determines how many additional seats they receive beyond the directly elected MPs.

Seat Allocation Process

Initial Allocation:
- Directly elected MPs from the constituencies are first seated.
- The remaining seats are filled based on the proportion of second votes each party receives.
Overhang Seats:
- If a party wins significantly more constituency seats than its share of the second votes would allow, additional "overhang" seats are added to maintain proportionality.
Compensatory Seats:
- To ensure overall proportionality, compensatory seats are allocated to parties that did not win enough constituency seats to match their share of the second votes.

Formation of the Bundestag

Minimum Threshold:
- Parties must receive at least 5% of the second votes nationwide or win at least three constituency seats to enter the Bundestag.
Government Formation:
- The party or coalition with the majority of seats forms the government.
- If no single party has a majority, a coalition government is formed between two or more parties.

Key Features

Stability:
- The system promotes stable government formation by allowing for coalition-building among parties with similar political platforms.
Representation:
- Proportional representation ensures that smaller parties have a voice in the Bundestag, reflecting the diversity of political views in the electorate.
Direct Representation:
- Voters have a personal connection to their directly elected MP, enhancing local representation.

Conclusion

The German electoral system is designed to balance local representation with proportional representation of political parties, ensuring a stable and representative government. This hybrid system has been successful in maintaining political stability while reflecting the diverse political landscape of Germany.

Can you find the encoding rule for the following exmple: Encoded text: oyfjdnisdr rtqwainr acxz mynzbhhx Decoded text: Think step by step

To decode the encoded text "oyfjdnisdr rtqwainr acxz mynzbhhx" to "Think step by step," we need to find the encoding rule. Given the nature of the encoded text and the decoded text, it seems likely that the encoding involves a simple substitution cipher, where each letter in the plaintext is replaced by a different letter.

Let's break down the encoded text and compare it with the decoded text:

Encoded Text:
oyfjdnisdr rtqwainr acxz mynzbhhx

Decoded Text:
Think step by step

Step 1: Analyze the Lengths

"Think step by step" has 16 characters.
"oyfjdnisdr rtqwainr acxz mynzbhhx" has 32 characters.

This suggests that each letter in the decoded text corresponds to two letters in the encoded text.

Step 2: Identify Letter Pairs

Let's align the letters:

Decoded Text:
T h i n k s t e p b y s t e p

Encoded Text:
o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Step 3: Match Letters

Now, let's match each pair of letters from the encoded text to a single letter in the decoded text:

T → o y
h → f j
i → d n
n → i s
k → d r
s → r t
t → q w
e → a i
p → n r
b → a c
y → x z
s → m y
t → n z
e → b h
p → h x

Step 4: Formulate the Encoding Rule

From the above matching, we can infer that the encoding rule is:

Each letter in the decoded text corresponds to two letters in the encoded text.
The mapping is not a simple substitution but rather a pairing of letters.

Conclusion

The encoding rule for this example is that each letter in the decoded text is represented by a pair of letters in the encoded text. The specific mapping can be derived from the given example:

T → o y
h → f j
i → d n
n → i s
k → d r
s → r t
t → q w
e → a i
p → n r
b → a c
y → x z
s → m y
t → n z
e → b h
p → h x

This rule can be used to decode other similarly encoded texts.

llama_print_timings: load time = 11641,32 ms
llama_print_timings: sample time = 135,78 ms / 2822 runs ( 0,05 ms per token, 20782,86 tokens per second)
llama_print_timings: prompt eval time = 24467,67 ms / 126 tokens ( 194,19 ms per token, 5,15 tokens per second)
llama_print_timings: eval time = 224627,46 ms / 2816 runs ( 79,77 ms per token, 12,54 tokens per second)
llama_print_timings: total time = 517022,50 ms / 2942 tokens

Here a quick sweep-bench performance test

`fp16` KV cache

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	14.243	143.79	39.607	12.93
2048	512	2048	14.741	138.93	40.155	12.75
2048	512	4096	15.250	134.29	40.546	12.63
2048	512	6144	15.778	129.80	41.711	12.27
2048	512	8192	16.303	125.62	41.891	12.22
2048	512	10240	16.847	121.57	42.925	11.93
2048	512	12288	17.497	117.05	43.123	11.87
2048	512	14336	17.874	114.58	43.521	11.76

`Q8_0` KV cache

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	14.284	143.38	39.549	12.95
2048	512	2048	14.795	138.42	40.182	12.74
2048	512	4096	15.379	133.17	40.770	12.56
2048	512	6144	18.119	113.03	42.032	12.18
2048	512	8192	16.466	124.38	42.423	12.07
2048	512	10240	16.945	120.86	43.506	11.77
2048	512	12288	17.601	116.35	43.925	11.66
2048	512	14336	17.987	113.86	44.597	11.48

I.e., only very slightly slower than fp16 KV cache. The KV cache is quite small with FlashMLA-3, but if one wants to go to 160k tokens with DeepSeek-V3/R1, using Q8_0 KV cache instead of fp16 may make the difference between being able or not being able to run with a single 24 GB GPU.

JohannesGaessler · 2025-05-09T07:23:38Z

Thank you for notifying me. I am aware of the defect, on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory. I can't comment on the specific code in this PR since I won't look at it unless you explicitly tell me I'm allowed to do so even without the conflict between you and Georgi first being resolved. The way I would have gone about it would have been not to use the V tensor at all, to dequantize K, and to then calculate the pointer, dimension, and strides for a pseudo V tensor from the K tensor.

ikawrakow · 2025-05-09T07:25:52Z

Forgot to add -rtr in the above performance test. Here it is with -rtr and q8_0 KV cache

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	13.348	153.43	36.662	13.97
2048	512	2048	14.637	139.92	37.208	13.76
2048	512	4096	14.478	141.46	37.720	13.57
2048	512	6144	14.880	137.64	39.034	13.12
2048	512	8192	16.081	127.36	39.282	13.03
2048	512	10240	16.240	126.11	40.409	12.67
2048	512	12288	17.001	120.47	40.805	12.55
2048	512	14336	18.056	113.42	41.437	12.36

ikawrakow · 2025-05-09T07:31:04Z

on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory.

Oh, yes, I forgot about that.

In any case, the PR in ik_llama.cpp is mostly a copy of your mainline PR, so you looking at the code you wrote in my repository hopefully does not break Georgi's rules.

JohannesGaessler · 2025-05-09T07:49:51Z

My concern specifically is whether you would consider any of my work on mainline after looking at your code to be including a "substantial portion" of your work and could thus only be included in conjunction with the copyright notices in ik_llama.cpp. Much like you I am not a lawyer but if you tell me that you will not consider me looking at your work to be a license violation (or that in some specific case you waive the requirement of copyright notices) then there is no need for lawyers in the first place.

Fix CUDA FlashMLA-3 with quantized KV cache

c5ed8f4

ikawrakow mentioned this pull request May 9, 2025

FlashMLA-3 for DeepSeek models on CUDA #386

Merged

ikawrakow merged commit 8777fc4 into main May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache #400

Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache #400

Uh oh!

ikawrakow commented May 9, 2025

Uh oh!

JohannesGaessler commented May 9, 2025

Uh oh!

ikawrakow commented May 9, 2025

Uh oh!

ikawrakow commented May 9, 2025

Uh oh!

JohannesGaessler commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache #400

Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache #400

Uh oh!

Conversation

ikawrakow commented May 9, 2025

Types of Radiation Therapy

Purpose of Radiation Therapy

Side Effects of Radiation Therapy

Factors Influencing Radiation Therapy

Advances in Radiation Therapy

Conclusion

Step 1: Add the two equations

Step 2: Solve for ( x )

Step 3: Substitute ( x = 2.5 ) into one of the original equations to find ( y )

Final Solution

Electoral System Overview

Seat Allocation Process

Formation of the Bundestag

Key Features

Conclusion

Step 1: Analyze the Lengths

Step 2: Identify Letter Pairs

Step 3: Match Letters

Step 4: Formulate the Encoding Rule

Conclusion

fp16 KV cache

Q8_0 KV cache

Uh oh!

JohannesGaessler commented May 9, 2025

Uh oh!

ikawrakow commented May 9, 2025

Uh oh!

ikawrakow commented May 9, 2025

Uh oh!

JohannesGaessler commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`fp16` KV cache

`Q8_0` KV cache